2025-05-28-12-07
xChemAgents: Agentic AI for Explainable Quantum Chemistry
Abstract
arXiv:2505.20574v1 Announce Type: new Abstract: Recent progress in multimodal graph neural networks has demonstrated that augmenting atomic XYZ geometries with textual chemical descriptors can enhance predictive accuracy across a range of electronic and thermodynamic properties. However, naively appending large sets of heterogeneous descriptors often degrades performance on tasks sensitive to molecular shape or symmetry, and undermines interpretability. xChemAgents proposes a cooperative agent framework that injects physics-aware reasoning into multimodal property prediction. xChemAgents comprises two language-model-based agents: a Selector, which adaptively identifies a sparse, weighted subset of descriptors relevant to each target, and provides a natural language rationale; and a Validator, which enforces physical constraints such as unit consistency and scaling laws through iterative dialogue. On standard benchmark datasets, xChemAgents achieves up to a 22% reduction in mean absolute error over strong baselines, while producing faithful, human-interpretable explanations. Experiment results highlight the potential of cooperative, self-verifying agents to enhance both accuracy and transparency in foundation-model-driven materials science. The implementation and accompanying dataset are available anonymously at https://github.com/KurbanIntelligenceLab/xChemAgents.
摘要
多模态图神经网络的最新进展表明,通过将原子XYZ几何结构与文本化学描述符相结合,可以提高对多种电子和热力学性质的预测准确性。然而,简单地附加大量异构描述符往往会降低对分子形状或对称性敏感任务的性能,并损害可解释性。xChemAgents提出了一种协作代理框架,将物理感知推理注入多模态性质预测中。xChemAgents包含两个基于语言模型的代理:选择器(Selector)自适应地识别与每个目标相关的稀疏加权描述符子集,并提供自然语言依据;验证器(Validator)通过迭代对话强制执行物理约束,如单位一致性和标度律。在标准基准数据集上,xChemAgents相较于强基线实现了高达22%的平均绝对误差降低,同时生成忠实、人类可解释的说明。实验结果凸显了协作自验证代理在提升基础模型驱动材料科学的准确性和透明度方面的潜力。实现代码及配套数据集可通过匿名链接https://github.com/KurbanIntelligenceLab/xChemAgents获取。
Manalyzer: End-to-end Automated Meta-analysis with Multi-agent System
Abstract
arXiv:2505.20310v1 Announce Type: new Abstract: Meta-analysis is a systematic research methodology that synthesizes data from multiple existing studies to derive comprehensive conclusions. This approach not only mitigates limitations inherent in individual studies but also facilitates novel discoveries through integrated data analysis. Traditional meta-analysis involves a complex multi-stage pipeline including literature retrieval, paper screening, and data extraction, which demands substantial human effort and time. However, while LLM-based methods can accelerate certain stages, they still face significant challenges, such as hallucinations in paper screening and data extraction. In this paper, we propose a multi-agent system, Manalyzer, which achieves end-to-end automated meta-analysis through tool calls. The hybrid review, hierarchical extraction, self-proving, and feedback checking strategies implemented in Manalyzer significantly alleviate these two hallucinations. To comprehensively evaluate the performance of meta-analysis, we construct a new benchmark comprising 729 papers across 3 domains, encompassing text, image, and table modalities, with over 10,000 data points. Extensive experiments demonstrate that Manalyzer achieves significant performance improvements over the LLM baseline in multi meta-analysis tasks. Project page: https://black-yt.github.io/meta-analysis-page/ .
摘要
元分析是一种系统性研究方法,通过整合多个现有研究的数据以得出综合结论。这种方法不仅能减轻单个研究固有的局限性,还能通过集成数据分析促进新发现。传统元分析涉及文献检索、论文筛选和数据提取等复杂多阶段流程,需要耗费大量人力与时间。尽管基于大语言模型的方法能加速某些环节,但仍面临重大挑战,例如论文筛选和数据提取中的幻觉问题。本文提出多智能体系统Manalyzer,通过工具调用实现端到端自动化元分析。该系统采用的混合评审、分层提取、自证与反馈校验策略显著缓解了上述两类幻觉问题。为全面评估元分析性能,我们构建了包含3个领域(文本、图像和表格模态)729篇论文的新基准数据集,涵盖超10,000个数据点。大量实验表明,Manalyzer在多类元分析任务中较基线大语言模型实现了显著性能提升。项目页面:https://black-yt.github.io/meta-analysis-page/。
Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting
Abstract
arXiv:2505.20521v1 Announce Type: new Abstract: This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.
摘要
本文介绍了莱利项目(Project Riley),一种新型多模态多模型对话式人工智能架构,旨在模拟受情绪状态影响的推理过程。受皮克斯电影《头脑特工队》启发,该系统由五个独立的情感代理(快乐、悲伤、恐惧、愤怒和厌恶)组成,这些代理通过结构化多轮对话进行回答生成、批评和迭代优化。最终推理机制将这些代理的贡献综合为连贯输出,既可体现主导情绪,也能整合多元观点。该架构整合了文本与视觉大语言模型(LLMs),并采用先进的推理与自我优化流程。我们在离线环境中部署了功能原型,针对情感表达能力和计算效率进行了优化。基于该原型衍生出应急场景专用版本Armando,通过检索增强生成(RAG)和累积上下文追踪技术,提供情绪适配且事实准确的信息。莱利项目原型通过用户测试进行评估,参与者与聊天机器人交互后完成结构化问卷,从三个维度进行测评:情绪适配性、清晰度与实用性、自然度与拟人性。结果显示在结构化场景中表现优异,尤其在情绪匹配和沟通清晰度方面。
SCAR: Shapley Credit Assignment for More Efficient RLHF
Abstract
arXiv:2505.20417v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used technique for aligning Large Language Models (LLMs) with human preferences, yet it often suffers from sparse reward signals, making effective credit assignment challenging. In typical setups, the reward model provides a single scalar score for an entire generated sequence, offering little insight into which token or span-level decisions were responsible for the outcome. To address this, we propose Shapley Credit Assignment Rewards (SCAR), a novel method that leverages Shapley values in cooperative game theory. SCAR distributes the total sequence-level reward among constituent tokens or text spans based on their principled marginal contributions. This creates dense reward signals, crucially, without necessitating the training of auxiliary critique models or recourse to fine-grained human annotations at intermediate generation stages. Unlike prior dense reward methods, SCAR offers a game-theoretic foundation for fair credit attribution. Theoretically, we demonstrate that SCAR preserves the original optimal policy, and empirically, across diverse tasks including sentiment control, text summarization, and instruction tuning, we show that SCAR converges significantly faster and achieves higher final reward scores compared to standard RLHF and attention-based dense reward baselines. Our findings suggest that SCAR provides a more effective and theoretically sound method for credit assignment in RLHF, leading to more efficient alignment of LLMs.
摘要
基于人类反馈的强化学习(RLHF)是一种广泛使用的技术,用于将大型语言模型(LLMs)与人类偏好对齐,但其常受稀疏奖励信号的困扰,导致有效的信用分配具有挑战性。在典型设置中,奖励模型仅为整个生成序列提供单一标量分数,难以揭示哪些词元或片段级决策对结果产生了影响。为解决这一问题,我们提出夏普利信用分配奖励(SCAR),这是一种利用合作博弈论中夏普利值的新方法。SCAR基于各成分词元或文本片段的边际贡献,将序列级总奖励按原则性分配。这一方法创造了密集的奖励信号,且关键无需训练辅助评论模型或依赖中间生成阶段的细粒度人工标注。与现有密集奖励方法不同,SCAR为公平信用归因提供了博弈论基础。理论上,我们证明SCAR保留了原始最优策略;实证上,在情感控制、文本摘要和指令调优等多样化任务中,相较于标准RLHF和基于注意力的密集奖励基线,SCAR收敛速度显著更快且最终奖励分数更高。我们的研究结果表明,SCAR为RLHF中的信用分配提供了一种更有效且理论可靠的方法,从而实现了LLMs更高效的对齐。
Scaling over Scaling: Exploring Test-Time Scaling Pareto in Large Reasoning Models
Abstract
arXiv:2505.20522v1 Announce Type: new Abstract: Large reasoning models (LRMs) have exhibited the capacity of enhancing reasoning performance via internal test-time scaling. Building upon this, a promising direction is to further scale test-time compute to unlock even greater reasoning capabilities. However, as we push these scaling boundaries, systematically understanding the practical limits and achieving optimal resource allocation becomes a critical challenge. In this paper, we investigate the scaling Pareto of test-time scaling and introduce the Test-Time Scaling Performance Model (TTSPM). We theoretically analyze two fundamental paradigms for such extended scaling, parallel scaling and sequential scaling, from a probabilistic modeling perspective. Our primary contribution is the derivation of the saturation point on the scaling budget for both strategies, identifying thresholds beyond which additional computation yields diminishing returns. Remarkably, despite their distinct mechanisms, both paradigms converge to a unified mathematical structure in their upper bounds. We empirically validate our theoretical findings on challenging reasoning benchmarks, including AIME, MATH-500, and GPQA, demonstrating the practical utility of these bounds for test-time resource allocation. We hope that this work provides insights into the cost-benefit trade-offs of test-time scaling, guiding the development of more resource-efficient inference strategies for large reasoning models.
摘要
大型推理模型(LRMs)已展现出通过内部测试时扩展提升推理性能的能力。基于此,进一步扩展测试时计算以释放更强推理能力成为具有前景的研究方向。然而,随着扩展边界的不断推进,系统理解实践极限并实现最优资源配置成为关键挑战。本文研究了测试时扩展的帕累托边界,并提出测试时扩展性能模型(TTSPM)。我们从概率建模角度理论分析了两种基本扩展范式——并行扩展与序列扩展。主要贡献在于推导出两种策略在扩展预算上的饱和点,确定了超出该阈值后额外计算将产生收益递减的临界值。值得注意的是,尽管机制不同,这两种范式在其上界处收敛于统一的数学结构。我们在AIME、MATH-500和GPQA等具有挑战性的推理基准上实证验证了理论发现,证明了这些边界对测试时资源分配的实际效用。本研究希望为测试时扩展的成本效益权衡提供见解,指导开发更具资源效率的大型推理模型推断策略。
CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models
Abstract
arXiv:2505.20642v1 Announce Type: new Abstract: Personalized programming tutoring, such as exercise recommendation, can enhance learners' efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students' programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
摘要
个性化编程辅导(如习题推荐)能够提升学习者的效率、动机和成果,这在现代数字教育中日益重要。然而,缺乏充足且高质量的编程数据,加之离线评估与实际学习场景的脱节,阻碍了此类系统的实际部署。为解决这一挑战,现有方法多尝试模拟学习者练习数据,却往往忽视编程学习细粒度、迭代式的本质,导致可解释性与精细度不足。为此,我们提出基于大语言模型的智能体CoderAgent,在不依赖真实数据的前提下细粒度模拟学生编程过程。具体而言,我们为每位人类学习者配备智能代理,其核心在于捕捉人类编程实践过程中的认知状态。受认知架构框架ACT-R启发,我们通过聚焦编程知识掌握与编码能力应用,设计CoderAgent结构以匹配人类认知架构。针对多层认知推理的固有规律,我们提出编程思维树(PTOT),将过程分解为'为何、如何、何处、何为'四个步骤,实现对迭代式问题解决策略的细粒度解析。最终,真实数据集上的实验评估表明,CoderAgent能为学习轨迹提供可解释的洞察,并实现精准模拟,为个性化编程教育铺平道路。
MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
Abstract
arXiv:2505.20670v1 Announce Type: new Abstract: Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR's superior performance, achieving state-of-the-art results compared to existing approaches.
摘要
涉及工具整合的复杂任务对大型语言模型(LLM)提出了重大挑战,这促使多智能体工作流成为一种有前景的解决方案。反思已成为纠正智能体工作流中错误轨迹的有效策略。然而,现有方法仅在行动后阶段利用这种能力,即智能体观察执行结果。我们认为,与人类类似,LLM也可以在行动执行前进行反思:智能体能够预见到自身决策可能产生的不良后果,这不仅为评估决策提供了必要的补充视角,还能防止错误在轨迹中传播。本文提出MIRROR框架,包含执行前批判性评估预期行动的内部反思(intra-reflection)和基于观察进一步调整轨迹的交互反思(inter-reflection)。这一设计系统性地利用LLM的反思能力,在更全面的范围内消除和纠正错误行动。在StableToolBench和TravelPlanner基准测试上的评估表明,MIRROR性能优越,相较于现有方法取得了最先进的结果。
LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation
Abstract
arXiv:2505.20671v1 Announce Type: new Abstract: While reinforcement learning (RL) has achieved notable success in various domains, training effective policies for complex tasks remains challenging. Agents often converge to local optima and fail to maximize long-term rewards. Existing approaches to mitigate training bottlenecks typically fall into two categories: (i) Automated policy refinement, which identifies critical states from past trajectories to guide policy updates, but suffers from costly and uncertain model training; and (ii) Human-in-the-loop refinement, where human feedback is used to correct agent behavior, but this does not scale well to environments with large or continuous action spaces. In this work, we design a large language model-guided policy modulation framework that leverages LLMs to improve RL training without additional model training or human intervention. We first prompt an LLM to identify critical states from a sub-optimal agent's trajectories. Based on these states, the LLM then provides action suggestions and assigns implicit rewards to guide policy refinement. Experiments across standard RL benchmarks demonstrate that our method outperforms state-of-the-art baselines, highlighting the effectiveness of LLM-based explanations in addressing RL training bottlenecks.
摘要
尽管强化学习(RL)在多个领域取得了显著成功,但针对复杂任务训练有效策略仍具挑战性。智能体常收敛于局部最优而无法最大化长期奖励。现有缓解训练瓶颈的方法主要分为两类:(1)自动化策略优化,通过从历史轨迹中识别关键状态来指导策略更新,但存在模型训练成本高且效果不确定的问题;(2)人机协同优化,利用人类反馈修正智能体行为,但难以扩展至动作空间庞大或连续的环境。本研究设计了一个大语言模型引导的策略调制框架,利用LLM改进RL训练而无需额外模型训练或人工干预。我们首先提示LLM从次优智能体的轨迹中识别关键状态,随后基于这些状态由LLM提供动作建议并分配隐式奖励以指导策略优化。标准RL基准测试表明,本方法优于现有最优基线,凸显了基于LLM的解释在解决RL训练瓶颈中的有效性。
Reinforcement Speculative Decoding for Fast Ranking
Abstract
arXiv:2505.20316v1 Announce Type: new Abstract: Large Language Models (LLMs) have been widely adopted in ranking systems such as information retrieval (IR) systems and recommender systems (RSs). To alleviate the latency of auto-regressive decoding, some studies explore the single (first) token decoding for ranking approximation, but they suffer from severe degradation in tail positions. Although speculative decoding (SD) methods can be a remedy with verification at different positions, they face challenges in ranking systems due to their left-to-right decoding paradigm. Firstly, ranking systems require strict latency constraints, but verification rounds in SD methods remain agnostic; Secondly, SD methods usually discard listwise ranking knowledge about unaccepted items in previous rounds, hindering future multi-token prediction, especially when candidate tokens are the unaccepted items. In this paper, we propose a Reinforcement Speculative Decoding method for fast ranking inference of LLMs. To meet the ranking systems' latency requirement, we propose an up-to-down decoding paradigm that employs an agent to iteratively modify the ranking sequence under a constrained budget. Specifically, we design a ranking-tailored policy optimization, actively exploring optimal multi-round ranking modification policy verified by LLMs via reinforcement learning (RL). To better approximate the target LLM under the constrained budget, we trigger the agent fully utilizing the listwise ranking knowledge about all items verified by LLMs across different rounds in RL, enhancing the modification policy of the agent. More importantly, we demonstrate the theoretical robustness and advantages of our paradigm and implementation. Experiments on both IR and RS tasks show the effectiveness of our proposed method.
摘要
大型语言模型(LLMs)已广泛应用于信息检索(IR)系统和推荐系统(RS)等排序系统中。为缓解自回归解码的延迟问题,现有研究探索采用首单令牌解码进行排序近似,但此类方法在尾部位置存在显著性能退化。虽然推测式解码(SD)方法可通过多位置验证缓解该问题,但其从左至右的解码范式在排序系统中面临挑战:首先,排序系统要求严格的延迟约束,而SD方法的验证轮次具有不可预知性;其次,SD方法通常会丢弃先前轮次中未通过验证项目的列表排序知识,这阻碍了后续多令牌预测,尤其当候选令牌为先前未通过验证项目时。本文提出一种基于强化学习的推测式解码方法,用于LLMs的快速排序推理。为满足排序系统的延迟要求,我们采用自顶向下解码范式,通过智能体在预算约束下迭代修改排序序列。具体而言,我们设计了面向排序的策略优化方法,通过强化学习(RL)主动探索经LLMs验证的最优多轮排序修改策略。为在预算约束下更好逼近目标LLM,我们在RL训练中促使智能体充分利用LLMs跨轮次验证的所有项目的列表排序知识,从而提升其修改策略的有效性。更重要的是,我们从理论上证明了该范式及其实现的鲁棒性与优势。在IR和RS任务上的实验验证了所提方法的有效性。
Comparisons between a Large Language Model-based Real-Time Compound Diagnostic Medical AI Interface and Physicians for Common Internal Medicine Cases using Simulated Patients
Abstract
arXiv:2505.20609v1 Announce Type: new Abstract: Objective To develop an LLM based realtime compound diagnostic medical AI interface and performed a clinical trial comparing this interface and physicians for common internal medicine cases based on the United States Medical License Exam (USMLE) Step 2 Clinical Skill (CS) style exams. Methods A nonrandomized clinical trial was conducted on August 20, 2024. We recruited one general physician, two internal medicine residents (2nd and 3rd year), and five simulated patients. The clinical vignettes were adapted from the USMLE Step 2 CS style exams. We developed 10 representative internal medicine cases based on actual patients and included information available on initial diagnostic evaluation. Primary outcome was the accuracy of the first differential diagnosis. Repeatability was evaluated based on the proportion of agreement. Results The accuracy of the physicians' first differential diagnosis ranged from 50% to 70%, whereas the realtime compound diagnostic medical AI interface achieved an accuracy of 80%. The proportion of agreement for the first differential diagnosis was 0.7. The accuracy of the first and second differential diagnoses ranged from 70% to 90% for physicians, whereas the AI interface achieved an accuracy rate of 100%. The average time for the AI interface (557 sec) was 44.6% shorter than that of the physicians (1006 sec). The AI interface (4.2). Patient satisfaction scores ranged from 4.2 to 4.3 for care by physicians and were 3.9 for the AI interface Conclusion An LLM based realtime compound diagnostic medical AI interface demonstrated diagnostic accuracy and patient satisfaction comparable to those of a physician, while requiring less time and lower costs. These findings suggest that AI interfaces may have the potential to assist primary care consultations for common internal medicine cases.
摘要
目的 开发基于大型语言模型(LLM)的实时复合诊断医疗AI接口,并通过与美国医师执照考试(USMLE)第二阶段临床技能(CS)考试相似的临床试验,比较该接口与医师对常见内科病例的诊断能力。方法 于2024年8月20日进行非随机临床试验。招募1名全科医师、2名内科住院医师(第2年和第3年)及5名模拟患者。临床案例改编自USMLE Step 2 CS考试。我们基于真实患者开发了10个代表性内科病例,包含初始诊断评估的可用信息。主要结局指标是第一鉴别诊断的准确率,重复性通过诊断一致性比例评估。结果 医师第一鉴别诊断准确率为50%-70%,而实时复合诊断医疗AI接口达到80%。第一诊断一致性比例为0.7。医师第一和第二鉴别诊断准确率为70%-90%,而AI接口达到100%。AI接口平均用时(557秒)较医师(1006秒)缩短44.6%。AI接口成本(0.08美元)较医师平均成本(4.2美元)降低98.1%。患者对医师诊疗满意度评分为4.2-4.3分,AI接口为3.9分。结论 基于LLM的实时复合诊断医疗AI接口展现出与医师相当的诊断准确率和患者满意度,同时具有用时更短、成本更低的优势。这些发现表明AI接口可能具备辅助常见内科病例初级诊疗的潜力。
RRO: LLM Agent Optimization Through Rising Reward Trajectories
Abstract
arXiv:2505.20737v1 Announce Type: new Abstract: Large language models (LLMs) have exhibited extraordinary performance in a variety of tasks while it remains challenging for them to solve complex multi-step tasks as agents. In practice, agents sensitive to the outcome of certain key steps which makes them likely to fail the task because of a subtle mistake in the planning trajectory. Recent approaches resort to calibrating the reasoning process through reinforcement learning. They reward or penalize every reasoning step with process supervision, as known as Process Reward Models (PRMs). However, PRMs are difficult and costly to scale up with a large number of next action candidates since they require extensive computations to acquire the training data through the per-step trajectory exploration. To mitigate this issue, we focus on the relative reward trend across successive reasoning steps and propose maintaining an increasing reward in the collected trajectories for process supervision, which we term Reward Rising Optimization (RRO). Specifically, we incrementally augment the process supervision until identifying a step exhibiting positive reward differentials, i.e. rising rewards, relative to its preceding iteration. This method dynamically expands the search space for the next action candidates, efficiently capturing high-quality data. We provide mathematical groundings and empirical results on the WebShop and InterCode-SQL benchmarks, showing that our proposed RRO achieves superior performance while requiring much less exploration cost.
摘要
大型语言模型(LLMs)在多种任务中展现出卓越性能,但作为智能体解决复杂多步骤任务仍具挑战性。实践中,智能体对关键步骤结果高度敏感,规划轨迹中的细微错误极易导致任务失败。现有方法多采用强化学习校准推理过程,通过过程监督对每个推理步骤进行奖励或惩罚(即过程奖励模型PRMs)。然而,由于需通过逐步轨迹探索获取训练数据,PRMs在面临大量候选动作时难以扩展且计算成本高昂。为此,我们聚焦于连续推理步骤间的相对奖励趋势,提出在过程监督中保持收集轨迹的奖励递增,称为奖励上升优化(RRO)。具体而言,我们逐步增强过程监督,直至识别出相对于前次迭代呈现正奖励差异(即奖励上升)的步骤。该方法动态扩展候选动作的搜索空间,高效捕获高质量数据。我们在WebShop和InterCode-SQL基准测试中提供了数学依据和实证结果,表明所提RRO方法在显著降低探索成本的同时实现了更优性能。
GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning
Abstract
arXiv:2505.20672v1 Announce Type: new Abstract: The Abstraction and Reasoning Corpus (ARC) poses a stringent test of general AI capabilities, requiring solvers to infer abstract patterns from only a handful of examples. Despite substantial progress in deep learning, state-of-the-art models still achieve accuracy rates of merely 40-55% on 2024 ARC Competition, indicative of a significant gap between their performance and human-level reasoning. In this work, we seek to bridge that gap by introducing an analogy-inspired ARC dataset, GIFARC. Leveraging large language models (LLMs) and vision-language models (VLMs), we synthesize new ARC-style tasks from a variety of GIF images that include analogies. Each new task is paired with ground-truth analogy, providing an explicit mapping between visual transformations and everyday concepts. By embedding robust human-intuitive analogies into ARC-style tasks, GIFARC guides AI agents to evaluate the task analogically before engaging in brute-force pattern search, thus efficiently reducing problem complexity and build a more concise and human-understandable solution. We empirically validate that guiding LLM with analogic approach with GIFARC affects task-solving approaches of LLMs to align with analogic approach of human.
摘要
抽象与推理语料库(ARC)对通用人工智能能力提出了严格测试,要求求解者仅通过少量示例推断抽象模式。尽管深度学习已取得显著进展,但在2024年ARC竞赛中,最先进模型的准确率仍仅为40-55%,这表明其性能与人类水平推理存在显著差距。本研究通过引入受类比启发的ARC数据集GIFARC来弥合这一差距。我们利用大语言模型(LLMs)和视觉语言模型(VLMs),从包含类比的各类GIF图像中合成新的ARC式任务。每个新任务均配有真实类比,提供视觉变换与日常概念间的显式映射。通过将强健的人类直觉类比嵌入ARC式任务,GIFARC引导智能体在展开暴力模式搜索前进行类比评估,从而有效降低问题复杂度并构建更简洁、人类可理解的解决方案。实证研究表明,采用GIFARC的类比方法引导LLMs会影响其任务解决方式,使其与人类的类比推理方法保持一致。
MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science
Abstract
arXiv:2505.20740v1 Announce Type: new Abstract: The rapid advancement of multimodal large language models (MLLMs) has unlocked new opportunities to tackle complex scientific challenges. Despite this progress, their application in addressing earth science problems, especially at the graduate level, remains underexplored. A significant barrier is the absence of benchmarks that capture the depth and contextual complexity of geoscientific reasoning. Current benchmarks often rely on synthetic datasets or simplistic figure-caption pairs, which do not adequately reflect the intricate reasoning and domain-specific insights required for real-world scientific applications. To address these gaps, we introduce MSEarth, a multimodal scientific benchmark curated from high-quality, open-access scientific publications. MSEarth encompasses the five major spheres of Earth science: atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere, featuring over 7K figures with refined captions. These captions are crafted from the original figure captions and enriched with discussions and reasoning from the papers, ensuring the benchmark captures the nuanced reasoning and knowledge-intensive content essential for advanced scientific tasks. MSEarth supports a variety of tasks, including scientific figure captioning, multiple choice questions, and open-ended reasoning challenges. By bridging the gap in graduate-level benchmarks, MSEarth provides a scalable and high-fidelity resource to enhance the development and evaluation of MLLMs in scientific reasoning. The benchmark is publicly available to foster further research and innovation in this field. Resources related to this benchmark can be found at https://huggingface.co/MSEarth and https://github.com/xiangyu-mm/MSEarth.
摘要
多模态大语言模型(MLLMs)的快速发展为解决复杂科学问题提供了新机遇。然而,其在地球科学领域(尤其是研究生层面)的应用仍待探索,主要障碍在于缺乏能够体现地球科学推理深度与语境复杂性的基准测试。现有基准多依赖合成数据集或简化的图文配对,无法充分反映实际科学应用所需的复杂推理与领域洞见。为填补这一空白,我们推出MSEarth——一个基于高质量开放获取科学文献构建的多模态科学基准。该基准涵盖地球科学五大圈层(大气圈、冰冻圈、水圈、岩石圈和生物圈),包含7,000余幅配图及精炼标注。这些标注源自原始图注并融合论文中的讨论与推理,确保基准能捕捉高级科学任务所需的细微推理与知识密集型内容。MSEarth支持科学配图标注、多选题及开放式推理挑战等多种任务。通过弥补研究生级基准的空白,该资源为科学推理中MLLMs的开发与评估提供了可扩展的高保真工具。基准数据集已公开以促进相关研究创新,资源详见https://huggingface.co/MSEarth与https://github.com/xiangyu-mm/MSEarth。
MT-Mol:Multi Agent System with Tool-based Reasoning for Molecular Optimization
Abstract
arXiv:2505.20820v1 Announce Type: new Abstract: Large language models (LLMs) have large potential for molecular optimization, as they can gather external chemistry tools and enable collaborative interactions to iteratively refine molecular candidates. However, this potential remains underexplored, particularly in the context of structured reasoning, interpretability, and comprehensive tool-grounded molecular optimization. To address this gap, we introduce MT-Mol, a multi-agent framework for molecular optimization that leverages tool-guided reasoning and role-specialized LLM agents. Our system incorporates comprehensive RDKit tools, categorized into five distinct domains: structural descriptors, electronic and topological features, fragment-based functional groups, molecular representations, and miscellaneous chemical properties. Each category is managed by an expert analyst agent, responsible for extracting task-relevant tools and enabling interpretable, chemically grounded feedback. MT-Mol produces molecules with tool-aligned and stepwise reasoning through the interaction between the analyst agents, a molecule-generating scientist, a reasoning-output verifier, and a reviewer agent. As a result, we show that our framework shows the state-of-the-art performance of the PMO-1K benchmark on 17 out of 23 tasks.
摘要
大语言模型(LLMs)在分子优化领域具有巨大潜力,因其能够整合外部化学工具并通过协同交互实现候选分子的迭代优化。然而,这种潜力尤其在结构化推理、可解释性以及基于工具的综合分子优化方面尚未得到充分探索。为填补这一空白,我们提出了MT-Mol——一个基于工具引导推理与角色专业化LLM代理的多智能体分子优化框架。该系统整合了全面的RDKit工具集,并将其划分为五个独立领域:结构描述符、电子与拓扑特征、基于片段的官能团、分子表示以及杂项化学性质。每个领域由专业分析代理负责,其任务是提取任务相关工具并提供可解释的、基于化学原理的反馈。MT-Mol通过分析代理、分子生成科学家、推理输出验证器和评审代理之间的交互,产生具有工具对齐和逐步推理特性的分子。实验结果表明,我们的框架在PMO-1K基准测试的23项任务中有17项达到了当前最优性能。
Can Agents Fix Agent Issues?
Abstract
arXiv:2505.20749v1 Announce Type: new Abstract: LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .
摘要
基于大语言模型的智能体系统正在成为一种新兴的软件范式,并已广泛应用于医疗、机器人和编程等多个领域。然而,维护这些系统需要大量投入,因为它们不可避免地存在缺陷,且需要持续演进以满足不断变化的外部需求。因此,自动解决智能体问题(即错误报告或功能需求)成为一项关键而具有挑战性的任务。尽管近期软件工程智能体(如SWE-agent)在解决传统软件系统问题方面展现出潜力,但其处理智能体系统中实际问题的有效性尚不明确,因为这类系统与传统软件存在显著差异。为填补这一空白,我们首先人工分析了201个真实场景中的智能体问题,识别出常见问题类别。随后投入500人时构建了AGENTISSUE-BENCH——一个包含50项智能体问题解决任务(每个任务均配备可执行环境及触发失败的测试用例)的可复现基准测试平台。我们进一步评估了当前最先进的软件工程智能体在该平台上的表现,发现其解决效率有限(仅3.33%-12.67%的解决率)。这些结果凸显了智能体系统维护相较于传统软件的独特挑战,表明需要进一步研究开发更先进的软件工程智能体来解决智能体问题。数据与代码详见https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/。
E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing
Abstract
arXiv:2505.20733v1 Announce Type: new Abstract: This paper presents an intelligent work automation approach in the context of contemporary digital transformation by integrating generative AI and Intelligent Document Processing (IDP) technologies with an Automation Agent to realize End-to-End (E2E) automation of corporate financial expense processing tasks. While traditional Robotic Process Automation (RPA) has proven effective for repetitive, rule-based simple task automation, it faces limitations in handling unstructured data, exception management, and complex decision-making. This study designs and implements a four-stage integrated process comprising automatic recognition of supporting documents such as receipts via OCR/IDP, item classification based on a policy-driven database, intelligent exception handling supported by generative AI (large language models, LLMs), and human-in-the-loop final decision-making with continuous system learning through an Automation Agent. Applied to a major Korean enterprise (Company S), the system demonstrated quantitative benefits including over 80% reduction in processing time for paper receipt expense tasks, decreased error rates, and improved compliance, as well as qualitative benefits such as enhanced accuracy and consistency, increased employee satisfaction, and data-driven decision support. Furthermore, the system embodies a virtuous cycle by learning from human judgments to progressively improve automatic exception handling capabilities. Empirically, this research confirms that the organic integration of generative AI, IDP, and Automation Agents effectively overcomes the limitations of conventional automation and enables E2E automation of complex corporate processes. The study also discusses potential extensions to other domains such as accounting, human resources, and procurement, and proposes future directions for AI-driven hyper-automation development.
摘要
本文提出了一种在当代数字化转型背景下的智能工作自动化方法,通过将生成式人工智能(AI)与智能文档处理(IDP)技术结合自动化代理(Automation Agent),实现企业财务费用处理任务的端到端(E2E)自动化。传统机器人流程自动化(RPA)虽在重复性、基于规则的简单任务自动化方面成效显著,但在处理非结构化数据、异常管理和复杂决策方面存在局限。本研究设计并实施了一个四阶段集成流程:通过OCR/IDP自动识别收据等证明文件、基于政策驱动数据库的项目分类、生成式AI(大语言模型LLM)支持的智能异常处理,以及人机协同最终决策与自动化代理持续学习的闭环系统。在韩国某大型企业(S公司)的应用表明,该系统实现了纸质收据费用任务处理时间减少80%以上、错误率降低与合规性提升等量化效益,以及准确性一致性增强、员工满意度提高和数据驱动决策支持等质性效益。系统通过从人工判断中学习,形成自动异常处理能力持续改进的良性循环。实证研究证实,生成式AI、IDP与自动化代理的有机结合能有效突破传统自动化局限,实现复杂企业流程的端到端自动化。研究还探讨了该方法在会计、人力资源和采购等领域的扩展潜力,并提出了AI驱动超自动化发展的未来方向。
MedSentry: Understanding and Mitigating Safety Risks in Medical LLM Multi-Agent Systems
Abstract
arXiv:2505.20824v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in healthcare, ensuring their safety, particularly within collaborative multi-agent configurations, is paramount. In this paper we introduce MedSentry, a benchmark comprising 5 000 adversarial medical prompts spanning 25 threat categories with 100 subthemes. Coupled with this dataset, we develop an end-to-end attack-defense evaluation pipeline to systematically analyze how four representative multi-agent topologies (Layers, SharedPool, Centralized, and Decentralized) withstand attacks from 'dark-personality' agents. Our findings reveal critical differences in how these architectures handle information contamination and maintain robust decision-making, exposing their underlying vulnerability mechanisms. For instance, SharedPool's open information sharing makes it highly susceptible, whereas Decentralized architectures exhibit greater resilience thanks to inherent redundancy and isolation. To mitigate these risks, we propose a personality-scale detection and correction mechanism that identifies and rehabilitates malicious agents, restoring system safety to near-baseline levels. MedSentry thus furnishes both a rigorous evaluation framework and practical defense strategies that guide the design of safer LLM-based multi-agent systems in medical domains.
摘要
随着大型语言模型(LLMs)在医疗领域日益广泛应用,确保其安全性——尤其在协作多智能体配置中——变得至关重要。本文提出MedSentry基准测试集,包含涵盖25个威胁类别、100个子主题的5000条对抗性医疗提示。结合该数据集,我们开发了端到端的攻防评估流程,系统分析四种代表性多智能体拓扑结构(层级式、共享池式、集中式和分布式)如何抵御"暗黑人格"智能体的攻击。研究发现这些架构在应对信息污染和保持稳健决策方面存在关键差异,暴露出其底层脆弱机制。例如,共享池式架构因开放信息共享而极易受攻击,而分布式架构凭借固有冗余和隔离展现出更强韧性。为降低风险,我们提出基于人格量表的检测校正机制,可识别并修复恶意智能体,使系统安全性恢复至接近基线水平。MedSentry不仅提供严谨的评估框架,还提出实用防御策略,为医疗领域基于LLM的多智能体系统设计更安全的方案。
Research on a Two-Layer Demand Response Framework for Electric Vehicle Users and Aggregators Based on LLMs
Abstract
arXiv:2505.20877v1 Announce Type: new Abstract: The widespread adoption of electric vehicles (EVs) has increased the importance of demand response in smart grids. This paper proposes a two-layer demand response optimization framework for EV users and aggregators, leveraging large language models (LLMs) to balance electricity supply and demand and optimize energy utilization during EV charging. The upper-layer model, focusing on the aggregator, aims to maximize profits by adjusting retail electricity prices. The lower-layer model targets EV users, using LLMs to simulate charging demands under varying electricity prices and optimize both costs and user comfort. The study employs a multi-threaded LLM decision generator to dynamically analyze user behavior, charging preferences, and psychological factors. The framework utilizes the PSO method to optimize electricity prices, ensuring user needs are met while increasing aggregator profits. Simulation results show that the proposed model improves EV charging efficiency, alleviates peak power loads, and stabilizes smart grid operations.
摘要
摘要:电动汽车(EV)的广泛普及提升了智能电网中需求响应的重要性。本文提出一种面向EV用户与聚合商的双层需求响应优化框架,利用大语言模型(LLM)平衡充电过程中的电力供需并优化能源利用。上层模型聚焦聚合商视角,通过调整零售电价实现利润最大化;下层模型针对EV用户,采用LLM模拟不同电价下的充电需求,优化成本与用户舒适度。研究通过多线程LLM决策生成器动态分析用户行为、充电偏好及心理因素,运用粒子群优化(PSO)方法进行电价优化,在满足用户需求的同时提升聚合商收益。仿真结果表明,该模型能有效提升EV充电效率、缓解电网峰值负荷并稳定智能电网运行。
Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving
Abstract
arXiv:2505.20869v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
摘要
大语言模型(LLMs)在解决数学问题方面展现出强大能力,但其求解过程仍可能出现逻辑推理与计算错误。为此,本文提出MATH-VF框架,通过整合形式化转换器(Formalizer)与验证器(Critic)来实现对大语言模型生成解法的形式化验证。该框架首先利用基于LLM的形式化转换器将自然语言解法转换为形式化表述,随后由验证器(集成计算机代数系统、SMT求解器等外部工具)对形式化语境中的每个陈述进行正确性评估。当发现错误陈述时,验证器将生成修正反馈。我们通过实证研究验证MATH-VF在两种场景下的有效性:1)验证场景:用于判定给定问题解法的正确性;2)优化场景:当LLM解法生成器针对给定问题产生错误解时,将验证器提出的修正建议反馈至生成器以重新生成解法。我们在广泛使用的数学基准测试集MATH500和ProcessBench上评估本框架,实验结果证明该方法优于现有技术方案。
Agent-Environment Alignment via Automated Interface Generation
Abstract
arXiv:2505.21055v1 Announce Type: new Abstract: Large language model (LLM) agents have shown impressive reasoning capabilities in interactive decision-making tasks. These agents interact with environment through intermediate interfaces, such as predefined action spaces and interaction rules, which mediate the perception and action. However, mismatches often happen between the internal expectations of the agent regarding the influence of its issued actions and the actual state transitions in the environment, a phenomenon referred to as \textbf{agent-environment misalignment}. While prior work has invested substantially in improving agent strategies and environment design, the critical role of the interface still remains underexplored. In this work, we empirically demonstrate that agent-environment misalignment poses a significant bottleneck to agent performance. To mitigate this issue, we propose \textbf{ALIGN}, an \underline{A}uto-A\underline{l}igned \underline{I}nterface \underline{G}e\underline{n}eration framework that alleviates the misalignment by enriching the interface. Specifically, the ALIGN-generated interface enhances both the static information of the environment and the step-wise observations returned to the agent. Implemented as a lightweight wrapper, this interface achieves the alignment without modifying either the agent logic or the environment code. Experiments across multiple domains including embodied tasks, web navigation and tool-use, show consistent performance improvements, with up to a 45.67% success rate improvement observed in ALFWorld. Meanwhile, ALIGN-generated interface can generalize across different agent architectures and LLM backbones without interface regeneration. Code and experimental results are available at https://github.com/THUNLP-MT/ALIGN.
Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
Abstract
arXiv:2505.21067v1 Announce Type: new Abstract: Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving complex reasoning problems, while zero-RL fails to significantly boost the frequency of these behaviors.
摘要
强化学习(RL)在提升大语言模型(LLMs)的推理能力方面发挥了重要作用。现有研究直接将RL应用于较小规模的基础模型(称为零RL方法),也取得了显著进展。然而,本文研究表明,仅需920个样本,基于基础模型的简单蒸馏方法即可明显超越通常需要更多数据和计算成本的零RL方法。通过分析模型输出的词元频率,我们发现蒸馏模型展现出更灵活的推理能力:其使用拟人化词元和逻辑连接词的频率显著高于零RL模型。进一步分析表明,蒸馏方法增强了两种高级认知行为的出现频率:多视角思考/尝试以及元认知意识。这两种高级认知行为的频繁出现催生了灵活的推理能力,而这正是解决复杂推理问题的关键;而零RL方法则未能显著提升这些行为的出现频率。
Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation
Abstract
arXiv:2505.21106v1 Announce Type: new Abstract: Large Vision Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet they also exhibit notable social biases. These biases often manifest as unintended associations between neutral concepts and sensitive human attributes, leading to disparate model behaviors across demographic groups. While existing studies primarily focus on detecting and quantifying such biases, they offer limited insight into the underlying mechanisms within the models. To address this gap, we propose an explanatory framework that combines information flow analysis with multi-round dialogue evaluation, aiming to understand the origin of social bias from the perspective of imbalanced internal information utilization. Specifically, we first identify high-contribution image tokens involved in the model's reasoning process for neutral questions via information flow analysis. Then, we design a multi-turn dialogue mechanism to evaluate the extent to which these key tokens encode sensitive information. Extensive experiments reveal that LVLMs exhibit systematic disparities in information usage when processing images of different demographic groups, suggesting that social bias is deeply rooted in the model's internal reasoning dynamics. Furthermore, we complement our findings from a textual modality perspective, showing that the model's semantic representations already display biased proximity patterns, thereby offering a cross-modal explanation of bias formation.
摘要
尽管大规模视觉语言模型(LVLMs)在多模态任务中取得了显著进展,但它们也表现出明显的社会偏见。这些偏见通常表现为中性概念与敏感人类属性之间的非预期关联,导致模型在不同人口统计群体中表现出差异化的行为。现有研究主要集中于检测和量化此类偏见,但对模型内部潜在机制的解释较为有限。为填补这一空白,我们提出一个结合信息流分析与多轮对话评估的解释性框架,旨在从内部信息利用失衡的角度理解社会偏见的起源。具体而言,我们首先通过信息流分析识别模型在回答中性问题时推理过程中涉及的高贡献图像标记;随后设计多轮对话机制评估这些关键标记对敏感信息的编码程度。大量实验表明,LVLMs在处理不同人口群体图像时存在系统性的信息使用差异,表明社会偏见深植于模型的内部推理动态中。此外,我们从文本模态角度补充研究发现,证明模型的语义表征已呈现有偏见的邻近模式,从而为偏见形成提供了跨模态解释。
Large Language Model-enhanced Reinforcement Learning for Low-Altitude Economy Networking
Abstract
arXiv:2505.21045v1 Announce Type: new Abstract: Low-Altitude Economic Networking (LAENet) aims to support diverse flying applications below 1,000 meters by deploying various aerial vehicles for flexible and cost-effective aerial networking. However, complex decision-making, resource constraints, and environmental uncertainty pose significant challenges to the development of the LAENet. Reinforcement learning (RL) offers a potential solution in response to these challenges but has limitations in generalization, reward design, and model stability. The emergence of large language models (LLMs) offers new opportunities for RL to mitigate these limitations. In this paper, we first present a tutorial about integrating LLMs into RL by using the capacities of generation, contextual understanding, and structured reasoning of LLMs. We then propose an LLM-enhanced RL framework for the LAENet in terms of serving the LLM as information processor, reward designer, decision-maker, and generator. Moreover, we conduct a case study by using LLMs to design a reward function to improve the learning performance of RL in the LAENet. Finally, we provide a conclusion and discuss future work.
摘要
低空经济网络(LAENet)旨在通过部署各类飞行器,在1000米以下空域为多样化飞行应用提供灵活且经济高效的空中组网服务。然而,复杂决策制定、资源限制及环境不确定性对LAENet的发展构成重大挑战。强化学习(RL)虽能应对这些挑战,但在泛化性、奖励函数设计和模型稳定性方面存在局限。大语言模型(LLM)的出现为缓解这些局限提供了新机遇。本文首先通过利用LLM的生成能力、上下文理解与结构化推理特性,提出将LLM与RL融合的教程框架;继而构建面向LAENet的LLM增强型RL框架,使LLM承担信息处理器、奖励函数设计器、决策生成器等多重角色。此外,我们通过案例研究验证了LLM设计奖励函数对提升LAENet中RL学习性能的有效性。最后总结研究结论并展望未来工作方向。
Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs
Abstract
arXiv:2505.21419v1 Announce Type: new Abstract: Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.
摘要
当今云托管应用程序和服务是复杂系统,性能或功能不稳定可能由数十甚至数百种潜在根源引起。我们的假设是:通过将现代AI工具的模式匹配能力与自然多模态RAG大语言模型界面相结合,可以简化问题识别与解决过程。ARCA是一种面向该领域的新型多模态RAG大语言模型系统。阶段性评估表明,ARCA在性能上优于现有最先进方案。
The Multilingual Divide and Its Impact on Global AI Safety
Abstract
arXiv:2505.21344v1 Announce Type: new Abstract: Despite advances in large language model capabilities in recent years, a large gap remains in their capabilities and safety performance for many languages beyond a relatively small handful of globally dominant languages. This paper provides researchers, policymakers and governance experts with an overview of key challenges to bridging the "language gap" in AI and minimizing safety risks across languages. We provide an analysis of why the language gap in AI exists and grows, and how it creates disparities in global AI safety. We identify barriers to address these challenges, and recommend how those working in policy and governance can help address safety concerns associated with the language gap by supporting multilingual dataset creation, transparency, and research.
摘要
尽管近年来大语言模型能力取得进展,但在全球少数主流语言之外,大多数语言的模型能力与安全性能仍存在显著差距。本文为研究人员、政策制定者和治理专家系统阐述了弥合人工智能"语言鸿沟"及降低多语言安全风险的关键挑战。我们深入分析了AI语言鸿沟存在并扩大的根源,及其如何导致全球AI安全领域的失衡发展。通过识别应对这些挑战的主要障碍,我们为政策与治理工作者提出建议:通过支持多语言数据集构建、提升透明度及加强相关研究,来应对由语言鸿沟引发的安全隐患。
Large Language Models Miss the Multi-Agent Mark
Abstract
arXiv:2505.21298v1 Announce Type: new Abstract: Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.
摘要
近期对大型语言模型多智能体系统(MAS LLMs)的关注,促使越来越多框架利用多个LLM处理复杂任务。然而,现有研究大多套用了MAS术语却未涉及其基础理论。本立场论文揭示了MAS理论与当前MAS LLMs实践间的关键差异,聚焦四个核心维度:智能体的社会属性、环境设计、协调与通信协议以及涌现行为测量。我们认为多数MAS LLMs缺乏自主性、社会交互和结构化环境等多智能体特征,往往依赖过度简化的LLM中心架构。若忽视MAS文献已解决的问题,该领域发展可能受阻并丧失潜力。为此,我们系统分析了这一现状并指出相关研究机遇,主张通过整合成熟的MAS概念和采用更精确的术语体系,避免误释并把握发展契机。
Out of the Past: An AI-Enabled Pipeline for Traffic Simulation from Noisy, Multimodal Detector Data and Stakeholder Feedback
Abstract
arXiv:2505.21349v1 Announce Type: new Abstract: How can a traffic simulation be designed to faithfully reflect real-world traffic conditions? Past data-driven approaches to traffic simulation in the literature have relied on unrealistic or suboptimal heuristics. They also fail to adequately account for the effects of uncertainty and multimodality in the data on simulation outcomes. In this work, we integrate advances in AI to construct a three-step, end-to-end pipeline for generating a traffic simulation from detector data: computer vision for vehicle counting from camera footage, combinatorial optimization for vehicle route generation from multimodal data, and large language models for iterative simulation refinement from natural language feedback. Using a road network from Strongsville, Ohio as a testbed, we demonstrate that our pipeline can accurately capture the city's traffic patterns in a granular simulation. Beyond Strongsville, our traffic simulation framework can be generalized to other municipalities with different levels of data and infrastructure availability.
摘要
如何设计一个能真实反映现实世界交通状况的交通仿真系统?以往文献中基于数据驱动的交通仿真方法往往依赖于不现实或次优的启发式规则,且未能充分考虑数据中的不确定性和多模态特性对仿真结果的影响。本研究整合人工智能领域的最新进展,构建了一个从检测器数据生成交通仿真的三步骤端到端流程:利用计算机视觉技术从监控视频中提取车辆计数,通过组合优化方法从多模态数据生成车辆路径,并运用大语言模型根据自然语言反馈进行迭代仿真优化。以俄亥俄州斯特朗斯维尔的道路网络为测试平台,我们证明该流程能够通过精细化仿真准确捕捉该城市的交通模式。该交通仿真框架可进一步推广至具有不同数据水平和基础设施条件的其他城市区域。
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Abstract
arXiv:2505.21327v1 Announce Type: new Abstract: Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial limitations of state-of-the-art MLLMs when subjected to holistic assessments of logical reasoning capabilities. Even the most advanced MLLMs show limited performance in comprehensive logical reasoning, with notable performance imbalances across reasoning types. In addition, we conducted an in-depth analysis of approaches such as ``thinking mode'' and Rule-based RL, which are commonly believed to enhance reasoning abilities. These findings highlight the critical limitations and performance imbalances of current MLLMs in diverse logical reasoning scenarios, providing comprehensive and systematic insights into the understanding and evaluation of reasoning capabilities.
摘要
逻辑推理是人类智能的核心要素,也是多模态大语言模型(MLLMs)的关键能力。尽管多模态推理研究取得了显著进展,但由于缺乏对逻辑推理类型的明确分类以及对推理本质的理解不足,现有基准测试难以全面评估模型的推理能力。为此,我们提出MME-Reasoning——一个全面评估MLLMs推理能力的基准测试,其问题涵盖归纳、演绎和溯因三类基本推理形式。我们通过严格的数据筛选确保每个问题都能有效评估推理能力而非感知技能或知识广度,并扩展评估协议以覆盖多样化问题的评测。实验表明,当对逻辑推理能力进行整体评估时,最先进的MLLMs仍存在显著局限:即便最先进的模型在综合逻辑推理中也表现有限,且不同推理类型间存在明显性能失衡。此外,我们对'思维模式'和基于规则的强化学习等常用推理增强方法进行了深入分析。这些发现揭示了当前MLLMs在多样化逻辑推理场景中的关键局限与性能失衡,为理解与评估推理能力提供了系统化的见解。
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations
Abstract
arXiv:2505.21318v1 Announce Type: new Abstract: While large language models (LLMs) with Chain-of-Thought (CoT) reasoning excel in mathematics and coding, their potential for systematic reasoning in chemistry, a domain demanding rigorous structural analysis for real-world tasks like drug design and reaction engineering, remains untapped. Current benchmarks focus on simple knowledge retrieval, neglecting step-by-step reasoning required for complex tasks such as molecular optimization and reaction prediction. To address this, we introduce ChemCoTBench, a reasoning framework that bridges molecular structure understanding with arithmetic-inspired operations, including addition, deletion, and substitution, to formalize chemical problem-solving into transparent, step-by-step workflows. By treating molecular transformations as modular "chemical operations", the framework enables slow-thinking reasoning, mirroring the logic of mathematical proofs while grounding solutions in real-world chemical constraints. We evaluate models on two high-impact tasks: Molecular Property Optimization and Chemical Reaction Prediction. These tasks mirror real-world challenges while providing structured evaluability. By providing annotated datasets, a reasoning taxonomy, and baseline evaluations, ChemCoTBench bridges the gap between abstract reasoning methods and practical chemical discovery, establishing a foundation for advancing LLMs as tools for AI-driven scientific innovation.
摘要
尽管具备思维链(CoT)推理能力的大语言模型(LLM)在数学和编程领域表现出色,但其在化学领域进行系统性推理的潜力尚未被发掘——该领域需要严格的分子结构分析以应对药物设计和反应工程等实际任务。现有基准测试主要关注简单知识检索,忽视了分子优化与反应预测等复杂任务所需的逐步推理能力。为此,我们提出ChemCoTBench推理框架,通过将分子结构理解与算术化操作(包括添加、删除和替换)相结合,将化学问题解决形式化为透明、分步骤的工作流程。该框架将分子转化视为模块化的"化学操作",支持慢思考推理模式,既遵循数学证明的逻辑,又将解决方案锚定于现实化学约束。我们在两个高影响力任务(分子性质优化与化学反应预测)上评估模型性能,这些任务既反映实际挑战又具备结构化可评估性。通过提供标注数据集、推理分类体系和基线评估结果,ChemCoTBench填补了抽象推理方法与实用化学发现之间的鸿沟,为推进LLM成为AI驱动科学创新的工具奠定基础。
Complex System Diagnostics Using a Knowledge Graph-Informed and Large Language Model-Enhanced Framework
Abstract
arXiv:2505.21291v1 Announce Type: new Abstract: In this paper, we present a novel diagnostic framework that integrates Knowledge Graphs (KGs) and Large Language Models (LLMs) to support system diagnostics in high-reliability systems such as nuclear power plants. Traditional diagnostic modeling struggles when systems become too complex, making functional modeling a more attractive approach. Our approach introduces a diagnostic framework grounded in the functional modeling principles of the Dynamic Master Logic (DML) model. It incorporates two coordinated LLM components, including an LLM-based workflow for automated construction of DML logic from system documentation and an LLM agent that facilitates interactive diagnostics. The generated logic is encoded into a structured KG, referred to as KG-DML, which supports hierarchical fault reasoning. Expert knowledge or operational data can also be incorporated to refine the model's precision and diagnostic depth. In the interaction phase, users submit natural language queries, which are interpreted by the LLM agent. The agent selects appropriate tools for structured reasoning, including upward and downward propagation across the KG-DML. Rather than embedding KG content into every prompt, the LLM agent distinguishes between diagnostic and interpretive tasks. For diagnostics, the agent selects and executes external tools that perform structured KG reasoning. For general queries, a Graph-based Retrieval-Augmented Generation (Graph-RAG) approach is used, retrieving relevant KG segments and embedding them into the prompt to generate natural explanations. A case study on an auxiliary feedwater system demonstrated the framework's effectiveness, with over 90% accuracy in key elements and consistent tool and argument extraction, supporting its use in safety-critical diagnostics.
摘要
本文提出了一种集成知识图谱(KGs)与大语言模型(LLMs)的新型诊断框架,用于支持核电站等高可靠性系统的故障诊断。当系统过于复杂时,传统诊断建模方法面临挑战,这使得功能建模成为更具吸引力的解决方案。我们的方法基于动态主逻辑(DML)模型的功能建模原理,构建了一个包含两个协同LLM组件的诊断框架:一个用于从系统文档自动构建DML逻辑的LLM工作流,以及一个支持交互式诊断的LLM智能体。生成的逻辑被编码为结构化知识图谱(KG-DML),支持分层故障推理。专家知识或运行数据可被纳入以提升模型精度和诊断深度。在交互阶段,用户提交自然语言查询,由LLM智能体解析后选择适当工具进行结构化推理(包括KG-DML的上下行传播)。该智能体区分诊断任务与解释任务:对于诊断任务,选择并执行外部工具进行结构化图谱推理;对于一般查询,采用基于图谱的检索增强生成(Graph-RAG)方法,检索相关图谱片段并嵌入提示词以生成自然语言解释。辅助给水系统的案例研究表明,该框架在关键要素上准确率超过90%,工具与参数提取结果稳定,验证了其在安全关键诊断中的适用性。
Autonomous Multi-Modal LLM Agents for Treatment Planning in Focused Ultrasound Ablation Surgery
Abstract
arXiv:2505.21418v1 Announce Type: new Abstract: Focused Ultrasound Ablation Surgery (FUAS) has emerged as a promising non-invasive therapeutic modality, valued for its safety and precision. Nevertheless, its clinical implementation entails intricate tasks such as multimodal image interpretation, personalized dose planning, and real-time intraoperative decision-making processes that demand intelligent assistance to improve efficiency and reliability. We introduce FUAS-Agents, an autonomous agent system that leverages the multimodal understanding and tool-using capabilities of large language models (LLMs). By integrating patient profiles and MRI data, FUAS-Agents orchestrates a suite of specialized medical AI tools, including segmentation, treatment dose prediction, and clinical guideline retrieval, to generate personalized treatment plans comprising MRI image, dose parameters, and therapeutic strategies. We evaluate the system in a uterine fibroid treatment scenario. Human assessment by four senior FUAS experts indicates that 82.5%, 82.5%, 87.5%, and 97.5% of the generated plans were rated 4 or above (on a 5-point scale) in terms of completeness, accuracy, fluency, and clinical compliance, respectively. These results demonstrate the potential of LLM-driven agents in enhancing decision-making across complex clinical workflows, and exemplify a translational paradigm that combines general-purpose models with specialized expert systems to solve practical challenges in vertical healthcare domains.
摘要
聚焦超声消融手术(FUAS)作为一种安全精准的无创治疗手段,已展现出显著临床应用前景。然而其实施过程涉及多模态影像解析、个性化剂量规划和实时术中决策等复杂任务,亟需智能辅助系统以提升效率与可靠性。本研究提出FUAS-Agents自主代理系统,通过整合大型语言模型(LLMs)的多模态理解与工具调用能力,协同患者资料与MRI数据,调度包括影像分割、治疗剂量预测和临床指南检索在内的专业医疗AI工具,生成涵盖MRI图像、剂量参数及治疗策略的个性化方案。在子宫肌瘤治疗场景的评估中,四位资深FUAS专家人工评审显示:生成方案在完整性、准确性、流畅性和临床合规性方面分别获得82.5%、82.5%、87.5%和97.5%的4分及以上评分(5分制)。该结果证实了LLM驱动代理在优化复杂临床决策流程方面的潜力,同时为通用模型与垂直领域专家系统的协同转化提供了范式,以解决医疗健康领域的实际挑战。
Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning
Abstract
arXiv:2505.21427v1 Announce Type: new Abstract: Early-stage startup investment is a high-risk endeavor characterized by scarce data and uncertain outcomes. Traditional machine learning approaches often require large, labeled datasets and extensive fine-tuning, yet remain opaque and difficult for domain experts to interpret or improve. In this paper, we propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models (LLMs) using in-context learning (ICL). Central to our method is a natural language policy embedded directly into the LLM prompt, enabling the model to apply explicit reasoning patterns and allowing human experts to easily interpret, audit, and iteratively refine the logic. We introduce a lightweight training process that combines few-shot learning with an in-context learning loop, enabling the LLM to update its decision policy iteratively based on structured feedback. With only minimal supervision and no gradient-based optimization, our system predicts startup success far more accurately than existing benchmarks. It is over 20x more precise than random chance, which succeeds 1.9% of the time. It is also 7.1x more precise than the typical 5.6% success rate of top-tier venture capital (VC) firms.
摘要
早期初创企业投资是一项高风险活动,其特点是数据稀缺且结果不确定。传统机器学习方法通常需要大量标注数据和精细调参,却仍缺乏透明度,难以让领域专家理解或改进。本文提出一种由记忆增强型大语言模型(LLM)驱动的透明高效投资决策框架,采用上下文学习(ICL)方法。我们的方法核心是将自然语言策略直接嵌入LLM提示中,使模型能够应用显式推理模式,并允许人类专家轻松解读、审核和迭代优化逻辑。我们引入一种轻量级训练流程,将小样本学习与上下文学习循环相结合,使LLM能够基于结构化反馈迭代更新决策策略。仅需极少量监督且无需基于梯度的优化,我们的系统对初创企业成功率的预测精度远超现有基准:其预测准确率比随机猜测(1.9%成功率)高出20倍以上,比顶级风险投资机构(VC)5.6%的平均成功率高出7.1倍。